DSCI Capstone Project Prep
Overview
Background
- 1 in 9 Americans older than 65 have Alzheimer’s dementia
- Treatment in the United States cost $305 billion in 2020
- 12.5 million projected to have Alzheimer’s dementia by 2050
- Disease still not fully understood
- Alzheimer’s has certain gene expressions: https://www.nia.nih.gov/news/gene-expression-signatures-alzheimers-disease
Data
I am using the micro array gene expression and diagnosis data from the Alzheimer’s Disease NeuroImaging Initiative. http://adni.loni.usc.edu/
The data contains Alzheimer’s diagnoses based on MRI scans, PET scans, and professional evaluations.
The data tables that I am using are:
ADMIMERGE: Contains basic patient information and their diagnoses.
ADNI_Gene_Expression_Profile: Contains gene expression data from blood samples.
Objective
I plan on using the blood gene expression data to build a predictive model for the detection of Alzheimer’s data. I also intend to do some exploratory analysis to see which gene expressions are associated with Alzheimer’s.
Data Prep
Imports and Markdown Setup
R setup
library(tidyverse)
library(reticulate)
library(tblhelpr)
Sys.setenv(RETICULATE_PYTHON = "/Users/liamf/miniconda3/envs/dsci2_py38/python.exe")
use_condaenv("dsci2_py38", required = TRUE)Python Setup
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
import matplotlib.pyplot as plt
import seaborn as snsReading and Processing Diagnosis Data
# Some of this code from previous work.
# Pipe for wrangling Diagnosis data.
diagnosis <- read_csv(
"/Users/liamf/OneDrive/Documents/DataSci2/Capstone Project/data/ADNIMERGE.csv"
) %>%
mutate(RID = as.integer(RID)) %>%
mutate(study_enrolled = case_when(RID < 2000 ~ "ADNI1",
RID >= 2000 & RID < 4000 ~ "ADNIGO",
RID >= 4000 & RID < 6000 ~ "ADNI2",
RID > 6000 ~ "ADNI3")
) %>%
mutate(baseline_diagnosis = recode(DX_bl,
EMCI = "MCI",
LMCI = "MCI",
SMC = "CN")) %>%
rename(VISCODE2 = VISCODE) %>%
filter(VISCODE2 == "bl") %>%
select(RID, baseline_diagnosis, study_enrolled, VISCODE2)
diagnosis## # A tibble: 2,380 x 4
## RID baseline_diagnosis study_enrolled VISCODE2
## <int> <chr> <chr> <chr>
## 1 2 CN ADNI1 bl
## 2 3 AD ADNI1 bl
## 3 4 MCI ADNI1 bl
## 4 5 CN ADNI1 bl
## 5 6 MCI ADNI1 bl
## 6 7 AD ADNI1 bl
## 7 10 AD ADNI1 bl
## 8 14 CN ADNI1 bl
## 9 15 CN ADNI1 bl
## 10 16 CN ADNI1 bl
## # ... with 2,370 more rows
Missing Data
# Viewing # missing for each column.
diagnosis %>%
summarise_all(funs(sum(is.na(.)))) %>%
rowid_to_column() %>%
transpose_tibble(rowid) %>%
rename()## # A tibble: 4 x 2
## columns `1`
## <chr> <int>
## 1 RID 0
## 2 baseline_diagnosis 28
## 3 study_enrolled 0
## 4 VISCODE2 0
Original diagnoses are actually missing.
RID (patient) Key
max(
diagnosis %>%
group_by(RID) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
select(count)
)## [1] 1
The RID (patient ID’s) occur no more than once. They correctly serve as a key.
Reading and Processing Genetic Data
# Data Wrangling pipe for genetic data.
ADNI1GO2_genes <- read_csv("/Users/liamf/OneDrive/Documents/DataSci2/Capstone Project/data/ADNI_Gene_Expression_Profile.csv") %>%
select(-"...2", -"...3") %>%
transpose_tibble(Phase, id_col = "former_columns") %>%
mutate(RID = str_replace(SubjectID, ".+?(_S_)", "")) %>%
mutate(RID = as.integer(RID)) %>%
select(RID, everything()) %>%
select(-Visit, -former_columns, -SubjectID,
-"260/280", -"260/230", -"RIN", -"Affy Plate",
-"YearofCollection", -"ProbeSet") %>%
drop_na()
ADNI1GO2_genes## # A tibble: 744 x 49,387
## RID `11715100_at` `11715101_s_at` `11715102_x_at` `11715103_x_at`
## <int> <chr> <chr> <chr> <chr>
## 1 1249 2.237 2.624 1.873 2.92
## 2 4410 2.294 2.416 1.884 2.668
## 3 4153 2.14 2.322 1.999 3.634
## 4 1232 2.062 2.5 1.851 3.632
## 5 4205 2.04 2.395 2.08 3.278
## 6 4467 2.439 2.309 1.997 3.578
## 7 205 1.955 2.451 1.539 3.362
## 8 2374 2.372 2.403 1.926 3.371
## 9 4491 2.327 2.738 1.922 4.124
## 10 4059 2.535 2.478 2.073 3.824
## # ... with 734 more rows, and 49,382 more variables: `11715104_s_at` <chr>,
## # `11715105_at` <chr>, `11715106_x_at` <chr>, `11715107_s_at` <chr>,
## # `11715108_x_at` <chr>, `11715109_at` <chr>, `11715110_at` <chr>,
## # `11715111_s_at` <chr>, `11715112_at` <chr>, `11715113_x_at` <chr>,
## # `11715114_x_at` <chr>, `11715115_s_at` <chr>, `11715116_s_at` <chr>,
## # `11715117_x_at` <chr>, `11715118_s_at` <chr>, `11715119_s_at` <chr>,
## # `11715120_s_at` <chr>, `11715121_s_at` <chr>, `11715122_at` <chr>, ...
Key Check
max(
ADNI1GO2_genes %>%
group_by(RID) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
select(count)
)## [1] 1
The RID (patient ID’s) occur no more than once. They correctly serve as a key.
Writing Prepped Data to CSV’s
write_csv(ADNI1GO2_genes, "/Users/liamf/OneDrive/Documents/DataSci2/Capstone Project/data/genes_prepped.csv")Joining Diagnosis Table to Gene Expression and Writing to CSV
data_prepped <- diagnosis %>%
inner_join(ADNI1GO2_genes, by = "RID")
write_csv(data_prepped, "/Users/liamf/OneDrive/Documents/DataSci2/Capstone Project/data/data_prepped.csv")Summary
The resulting table after the data wrangling has the following characteristics:
N Row: 774, corresponds to 774 people.
N Col: 49390, most of these are the blood gene expressions for various genes.
Classes: There are three target classes: CN (cognitively normal), MCI (mild cognitive impairment), AD (Alzheimer’s Disease).
Data Visualization and Prep for Modelling in SKLearn
data_prepped %>%
ggplot(aes(baseline_diagnosis)) +
geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust=-0.3) +
theme_minimal()Due to the small number of those diagnosed with Alzheimer’s in the data, a StratifiedShuffleSplit will be used to split the data into training and testing.
Reading and Splitting Data Read Into Pandas
pandas_data_prepped = pd.read_csv('/Users/liamf/OneDrive/Documents/DataSci2/Capstone Project/data/data_prepped.csv')Checking Data Types
print(pandas_data_prepped.iloc[: , :8].dtypes)## RID int64
## baseline_diagnosis object
## study_enrolled object
## VISCODE2 object
## 11715100_at float64
## 11715101_s_at float64
## 11715102_x_at float64
## 11715103_x_at float64
## dtype: object
# Performing a Stratified Shuffle Split on the diagnosis variable.
split = StratifiedShuffleSplit(n_splits=1, test_size=0.25, random_state=42)
for train_index, test_index in split.split(pandas_data_prepped, pandas_data_prepped["baseline_diagnosis"]):
strat_train_set = pandas_data_prepped.loc[train_index]
strat_test_set = pandas_data_prepped.loc[test_index]
print(strat_test_set["baseline_diagnosis"].value_counts())## MCI 110
## CN 65
## AD 11
## Name: baseline_diagnosis, dtype: int64
strat_train_set.to_csv('/Users/liamf/OneDrive/Documents/DataSci2/Capstone Project/data/train_set.csv')
strat_test_set.to_csv('/Users/liamf/OneDrive/Documents/DataSci2/Capstone Project/data/test_set.csv')Data Visualization
# pairplot
pair1 = sns.pairplot(data=strat_train_set.iloc[:,5:10])
plt.show()plt.clf()
# pairplot with diagnosis diferentiation
pair2 = sns.pairplot(data=strat_train_set.iloc[:,[1,5,6,7,8,9,10]], hue = "baseline_diagnosis")
plt.show()plt.clf()# dataprep for correlation heatmap
df_sub = strat_train_set.iloc[:,4: ].sample(n=10, axis='columns', random_state=42)
cor = df_sub.corr()# Triangle correlation heatmap
mask = np.triu(np.ones_like(cor, dtype=np.bool))## <string>:1: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
## Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
heatmap = sns.heatmap(df_sub.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='BrBG')
plt.show()plt.clf()Considerations for Part 2 (Modelling)
- The number of columns far outweighs the number of rows which will have to be accounted for when choosing model’s.
- Variable selection is key as 49,390 variables is too many for 774 rows. PCA, chi-square test’s, models with their own variable selection are all options.
- Over fitting will also be a huge concern. The models that are trained will have to account for this and the hyperperameters will likely have to do lots of regularization.